In this tutorial, we will be analyzing sample N1 from a dataset of tumor and immune cell populations from human early-stage lung adenocarcinomas (He et al., Oncogene 2021). Data has been obtained using 10X Genomics. The sample N1 is composed of 14,972 single cells that were sequenced on the Illumina Hiseq X. Raw data can be download from here. Prior to the analysis with PoPsicleR, raw data has been processed with Cell Ranger software to align reads and generate feature-barcode matrices. Cell Ranger data, comprising the gzipped TSV files of the feature-barcode matrix and of feature and barcode sequences, can be found here.
We start assigning a name to the sample (i.e., N1) and defining the path to the folders containing raw input data (e.g., cellranger_data) and where Seurat output objects (e.g., RDS objects) will be saved. PoPsicleR requires the path to the folder of raw input data as input for its preprocessing, rather than the single matrix file (see Supplementary data of Grandi et al., 2021 for additional details on the structure of data folders).
library(popsicle)
## Hello and welcome to PoPsicle, an interactive pipeline for the preprocessing of single cell data.
## In this workflow, messages are colour coded:
## green messages will provide information on the ongoing step
## cyan messages will require the user to provide an input for advancing the script
## red messages will report missing information or wrong inputs
## grey messages report functions and software system outputs.
# set the working directory and sample name
sample.name <- "N1"
main.dir <- file.path("~", "PoPsicleExample", sample.name)
input.data.dir <- file.path(main.dir, "cellranger_data")
output.data.dir <- file.path(main.dir, "RDS objects")
if (!file.exists(output.data.dir)) {
dir.create(output.data.dir)
}
setwd(main.dir)
We start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out our GitHub Wiki. For example, the count matrix is stored in pbmc[["RNA"]]@counts.
# Define a list of cell-type markers
populations.markers = c("PTPRC", "CD34", "PROM1", "CD3D", "CD3E", "TNFRSF4", "CD4", "IL7R")
# Create a Seurat object from the raw data and generate QC plots
sample.umi <- PrePlots(sample = sample.name, input_data = input.data.dir, genelist = populations.markers)
##
## Plotting QC Violin plots
## Plots saved in: 02.QC_Plots\02a_violin_plots.pdf
## Plotting QC Density plots
## Plots saved in: 02.QC_Plots\02b_QC_Hist_nGene_nUMI_MTf_Ribo.pdf
## Plotting QC Scatter plots
## Plots saved in: 02.QC_Plots\02c_QC_Scatter_nGene_nUMI_MTf.pdf
## Plotting QC per gene Histograms
## Plots saved in: 02.QC_Plots\02d_QC_Hist_Check.pdf
## Plotting QC per gene Scatter plots
## Plots saved in: 02.QC_Plots\02e_QC_Scatter_Check.pdf
##
## Now check the graphs, choose your thresholds and then run FilterPlots